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Abstract 

Inferring the causal structure of a set of random variables from a finite sample 
of the joint distribution is an important problem in science. Recently, methods using 
additive noise models have been suggested to approach the case of continuous variables. 
In many situations, however, the variables of interest are discrete or even have only 
finitely many states. In this work we extend the notion of additive noise models to 
these cases. We prove that whenever the joint distribution p( x - y ) admits such a model 
in one direction, e.g. Y = f(X) + N, N i X, it does not admit the reversed model 
X = g(Y) + N, N A. Y as long as the model is chosen in a generic way. Based on these 
deliberations we propose an efficient new algorithm that is able to distinguish between 
cause and effect for a finite sample of discrete variables. In an extensive experimental 
study we show that this algorithm works both on synthetic and real data sets. 



1 Introduction 



Inferring causal relations by analyzing statistical dependences among observed random 
variables is a challenging task if no controlled randomized experiments are available. So- 



called constraint-based approaches to causal discovery (Pearl 2000 Spirtes et al. 19931 



select among all directed acyclic graphs (DAGs) those that satisfy the Markov condition 
and the faithfulness assumption, i.e., those for which the observed independences are 
imposed by the structure rather than being a result of specific choices of parameters of 
the Bayesian network. These approaches are unable to distinguish among causal DAGs 
that impose the same independences. In particular, it is impossible to distinguish between 
X — > Y and Y — > X. More recently, several methods have been suggested that use 
not only conditional independences, but also more sophisticated properties of the joint 
distribution. For simplicity, we explain the ideas for the two variable setting since this 



case is particularly challenging. Kano fc Shimizu| ( 2003 ) use models 



Y = f(X) + N 
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where / is a linear function and ./V is additive noise that is independent of the hypothetical 
cause X. This is an example for an additive noise model from X to Y. Apart from trivial 
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cases, P(X, Y) can only admit such a model from X to Y and from Y to X in the bivariate 
Gaussian case. |Hoyer et al. (20091 generalize the method to non-linear functions / and 
showed that generic models of this form generate joint distributions that do not admit 
such an additive noise model from Y to X. Zhang & Hyvarinen (2009 1 augment the model 
by applying a non-linear function g to the rhs of eq. ([!]) and still obtain identifiability for 
generic cases. Peters et al. (2009) use independent linear additive noise models in order to 



detect whether a sample of a time series has been reversed. Their positive results further 
support this way of causal reasoning. All these proposals, however, were only designed for 
real- valued variables X and Y. 

For discrete variables, Sun et al. (2008 ) propose a method to measure the complexity of 
causal models via a Hilbert space norm of the logarithm of conditional densities and prefer 
models that induce smaller norms. Sun et al. (2006) fit joint distributions of cause and 
effect with conditional densities whose logarithm is a second order polynomial (up to the 
log-partition function) and show that this often makes causal directions identifiable when 
some or all variables are discrete. For discrete variables, several Bayesian approaches 
(Heckerman et al. 1999) are also applicable, but the construction of good priors are 
challenging and often the latter are designed such that Markov equivalent DAGs still 
remain indistinguishable. 

Here, we extend the model in eq. ([TJ to the discrete case in two different ways: (I) 
If both X and Y take values in Z (the support may be finite, though) additive noise 
models can be defined analogously to the continuous case. (II) If both X and Y take 
only finitely many values we can still define additive noise models by interpreting the 
+ sign as an addition in the finite ring Z/mZ. We propose to apply this method to 
variables where the cyclic structure is appropriate (e.g., the direction of the wind after 
discretization, day of the year, season). However, the applicability of this second model 
class is not restricted to random variables that take integers as values: Assume that X and 
Y take values in A := {ai, . . . , a m } and B := {&i, . . . , bm}, which are structureless sets. 
Considering functions / : A — > B and models with P(Y = bj \ X = aj) = p if bj = /(aj) and 
(1 — p)/(m — 1) otherwise, is a special case of an additive noise model: Impose any cyclic 
structure on the data and use the additive noise P(iV = 0) = p, P(N = I) = (1— p)/(m— 1) 
for I ^ 0. This may be helpful whenever the random variables are categorical and when 
these categories do not inherit any kind of ordering (e.g. different treatments of organisms 
or phenotypes). 

In the following article we refer to (I) by saying integer constraint, whereas model (II) 
satisfies the cyclic constraint. 

The main idea of the causal inference method we propose goes as follows: If such an 
additive noise model exists in one direction but not in the other, we prefer the former 
based on Occam's Razor and infer it to be the causal direction. 

Such a procedure is only sensible if there are only few instances, in which there is an 
additive noise models in both directions. If, for example, all additive noise models from 
X to Y also allow an additive noise model from Y to X, we could not draw any causal 
conclusions at all. We will show that reversible cases are very rare and thereby answer 
this theoretical question. 

For a practical causal inference method we have to test whether the data admits an 
additive noise model and thus have to perform a discrete regression. But since in the 
discrete case regularization of the regression function is not necessary, in principle we 
would have to check all possible functions and test whether they result in independent 
residuals. This is highly intractable, of course, and we therefore propose an efficient 
heuristic procedure that proved to work very well in practice. 
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In section 2 we extend the concept of additive noise models to discrete random variables 
and show the corresponding identifiability results for generic cases in section 3. In section 
4 we introduce an efficient algorithm for causal inference on finite data, for which we show 
experimental results in section 5. We conclude in section 6. 



2 Additive Noise Models for Discrete Variables 



As it has been proposed for the continuous case by Shimizu et al. (20061; Hoyer et al. 



(2009); Zhang & Hyvarinen (2009) we assume the following causal principle to hold 



throughout the remainder of this article: 

Causal Inference Principle (for discrete random variables) Whenever Y sat- 
isfies an additive noise model with respect to X and not vice versa then X is a cause for 
Y, and we write X — ► Y. 



Janzing & Steudel (2009) give further theoretical support for this principle using the 



concept of Kolmogorov complexity and Peters et al. ( 2009 ) use this way of reasoning for 
detecting the arrow of time. 

Note that whenever there is no additive noise model in any direction (which may well 
happen) we do not draw any causal conclusions and other causal inference methods should 
be tried. 

We now precisely explain what we mean by an additive noise model in the case of 
discrete random variables. For simplicity we denote px{ x ) = ~P(X = x), Py{v) = P{Y = 
y), n{l) = P(iV = /) and h(k) = P(N = k) and suppX is defined as the set of all values 
that X takes with probability larger than 0: suppX := {k \ px{k) > 0}. 



2.1 Integer Constraint 

Assume that X and Y take values in Z (their distributions may have finite support). We 
say that there is an additive noise model (ANM) from X to Y if 

Y = f(X)+N, N±X 

where / : Z — > Z is an arbitrary function and N a noise variable that takes integers as 
values, too. 

Furthermore we require n(0) > n(j) for all j ^ 0. This does not restrict the model 
class, but is due to a freedom we have in choosing / and N: If Y = f{X) + N, N 1 X, 
then we can always construct a new function fj, such that Y = fj{X) + Nj, Nj A. X by 
choosing fj(i) = f{i) + j and Uj(i) = n{i + j). 

Such an additive noise model is called reversible if there is also an additive noise model 
from Y to X, i.e. if it satisfies an additive noise model in both directions. 



2.2 Cyclic Constraint 

We can extend additive noise models to random variables which inherit a cyclic structure 
and therefore take values in a periodic domain. Random variables are usually defined 
as measurable maps from a probability space into the real numbers. We thus make the 
following definition 

Definition 1 Let (D,,J-,P) be a probability space. A function X : f2 — ► Z/mZ is called 
an m-cyclic random variable if X~ l (k) G FMk G Z/mZ. All other concepts of probability 
theory (like distributions and expectations) can be constructed analogous to the well-known 
case, in which X takes values {0, . . . , m — 1}. 
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Let X and Y be m- and m-cyclic random variables. We say that Y satisfies an additive 
noise model from X to Y if there is a function / : Z/mZ — * Z/rhZ and an m-cyclic noise 
N such that 

Y = f(X) + N and N X X. 

Again we require n(0) > n{j) for all j / and call this model reversible if there is a 
function g : Z/mZ — ► Z/mZ and an m-cyclic noise N such that 

X = g(Y) + N and TV X Y. 

2.3 Relations 

The following two remarks are essential in order to understand the relationship between 
integer and cyclic constraints: 

(1) The difference between these two models manifests in the target domain. If we 
consider an ANM from X to Y it is important whether we put integer or cyclic constraints 
on Y (and thus on N). It does not make a difference, however, whether we consider 
the regressor X to be cyclic (with a cycle larger than the support of X) or not. The 
independence constraint remains the same. 

(2) In the finite case additive noise models with cyclic constraints are more general 
than the ones with integer constraints: Assume there is an ANM Y = f{X) + N, where 
all variables are taken to be non-cyclic and Y takes values between k and I, say. Then 
we still have an ANM Y = f{X) + TV" if we regard Y to be I — k + 1-cyclic because N 
mod (/ — k + 1) remains independent of X. It is possible, however, that N X^ X, but N 
mod (Z — k + 1) X X (see Example 2 in Section 3.2.1). 
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3 Identifiability 

Whether an additive noise model is allowed 
depends on the form of the joint distribution 
P( x ' y ). Let F be the subset of the set A of all 
possible joint distributions that allow an additive 
noise model from X to Y in the "forward direc- 
tion" , whereas B allows an additive noise model 
in the backward direction from Y to X. Some 
trivial examples like px(0) = l,n(0) = 1 and 
/(0) = 2 immediately show that there are joint distributions allowing additive noise mod- 
els in both directions, meaning FnB (see Figure[T]). But how large is this intersection? 
Our method would not be useful if we find out that F and B are almost the same sets. 
Then most additive noise models can be fit in either both directions or in none. For 
additive noise models with integer constraints and with cyclic constraints we identify the 
intersection B n F and show that it is indeed a very small set. If we are unlucky and 
the data generating process we consider happens to be in B n F, our method does not 
give wrong results, but answers "I do not know the answer". In all other situations the 
method identifies the correct direction given that we observe enough data. The proofs are 
provided in the appendix. 
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3.1 Integer Constraint 



3.1.1 Y or X has finite support 

First we assume that either the support of X or the support of Y is finite. This already 
covers the situation in most applications. Figure [2] (the dots indicate a probability greater 
than 0) shows an example of a joint distribution that allows an ANM from X to Y, but 
not from Y to X. This can be seen easily at the "corners" X = 1 and X = 7: Whatever 
we choose for g(0) and 5(4), the distribution of N \ Y = is supported only by one point, 
whereas N \ Y = 4 is supported by 3 points. Thus N cannot be independent of Y. 

Figure [3] shows a (rather non-generic) example that allows an ANM in both directions 
if we choose px(«i) = §g,Px(&i) = m for i = 1, . . . , 4 and p x {a%) = ^,Px(bi) = ^ for 
i = 5,...,8. 

We prove the following 



Theorem 2 An additive noise model X — > Y is reversible there exists a disjoint 
decomposition (jl=i Ci = suppX, such that 

• The C\s are shifted versions of each other 

V* 3di>0:Ci = C + di 

and f is piecewise constant: 

f \d= <k Vi. 

• The probability distributions on the CjS are shifted and scaled versions of each other 
with the same shift constant as above: For x G C L 

P ( * = ,) = P ( *=,-*).*£f|| 

holds. 

• The sets Ci + supp := {c{ + h : n(h) > 0} are disjoint. 

Obviously, such a decomposition for supp Y that satisfies the same criteria must exist, too 
(symmetry argument). We are now given a full characterization of all cases that allow 
an ANM in both directions. Even each of the conditions by itself is very restrictive, so 
that all conditions together describe a very small class of models: in almost all cases the 
direction of the model is identifiable. In Figure [3] all ctj belong to Co, all bj to C\ and 
di = 1. 

As for the other theorems of this section the proof is provided in the appendix. Its 
main point is based on the asymmetric effects of the "corners" of the joint distribution. 
In order to allow for an infinite support of X (or Y) the proof generalizes the concept of 
"corners" . 



3.1.2 X and Y have infinite support 

Theorem 3 Consider an additive noise model X — > Y where both X and Y have infinite 
support. We distinguish between two cases 

1. N has compact support: 3m, I £ Z, such that suppiV = [m, /]. 

Assume there is an ANM from X to Y and f does not have infinitely many infinite 
sets, on which it is constant. Then the model is reversible <J=^ there exists a disjoint 
decomposition Ul=i C« = suppX that satisfies the same conditions as in TheoremlA 
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Figure 2: This joint distribution 
satisfies an additive noise model 
only from XtoY. 



Figure 3: If we choose the parameters carefully this 
joint distribution allows additive noise models in 
both directions. (Thickness stands for probability 
values.) 



2. N has entire Z as support: P(N = k) > OVfc G Z. 

Suppose X and Y are not independent and there is an ANM X — > Y and Y — > X. If 
f, the distribution of N and all px(k) for all k > m for any m E Z are known, then 
all other values px(k) for k < m are determined. That means even only a small 
fraction of the parameters determine the remaining parameters. 

Note that the first case is again a complete characterization of all instances of a joint 
distribution, an ANM in both directions is conform with. The second case does not yield 
a complete characterization, but shows how restricted we are for a given function / and 
noise N in order to choose a distribution that yields a reversible ANM. 

3.2 Cyclic Constraint 

Again we investigate how often X and Y can both satisfy an additive noise model with 
respect to each other, this time considering cyclic constraints. Assume Y = f{X) + N 
with N JL X. We will show that in the generic case the model is not reversible, meaning 
there is no g and N, such that X = g(Y) + N with N _LY. 

Note that the model Y = f{X) + N is reversible if and only if there is a function g, 
such that 

p(x) ■ n(y - f(x)) = q(y) ■ h(x - g(y)) Vx, y , (2) 

where 

p[g(y) + a) ■ n(y - f(g(y) + a) J 
Q{y) = 2^P{x)n(y - /(£)) and h{a) = Vy : q{y) / 

3.2.1 Non-Identifiable Cases 

First, we give three examples of additive noise models that are not identifiable. This 
restricts the class of situations in which identifiability can be expected. Figure [4] shows 
instances of all three examples. 

Example 1 X and Y can be independent even though there is an ANM from X to Y: 

(i) IfY = f{X) + N and f{k) = const for all k : p{k) 7^ 0, then the model is reversible. 

(ii) IfY = f{X) + iV for a uniformly distributed noise N, then the model is reversible. 
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Proof, (i) It follows Y = N + const with probability one. Thus X and Y are independent 
and X = g(Y) + X for g = is a backward model, (ii) N := f{X) + N is still uniformly 
distributed and thus independent of X. We thus have again Y = N + 0. □ 

Example 2 If Y = f{X) + N for a bijective and affine f and uniformly distributed X , 
then the model is reversible. 

Proof. Since X is uniformly distributed and f{x) = ax + b is bijective, Y is uniform, too. 
For g(y) = f^ 1 (y) and h(k) = n(y — f(g(y) + k)) = n(b — f{k)) equation ^ is satisfied. 
□ 




Figure 4: The joint distributions allow additive noise models in both directions . They 
are instances of Examples l(i), 1 (ii) and 2 (from left to right). 



3.2.2 Identifiability Results 

Motivated by the counter examples we now make the assumptions that / is not constant 
(Example l(i)), N is not uniformly distributed (Example 1(h)) and that not both X and 
Y are uniformly distributed (Example 2). Without proof we state the conjecture that this 
is already enough to ensure identifiability meaning an ANM can only hold in one direction. 

Theorem 4 (Conjecture) Assume X,Y and N are random variables that are not uni- 
formly distributed and non- degenerate (that is they do not only take one value). Assume 
further that we have an additive noise model Y = f{X) + N, N A. X with non-constant 
f. Then there is no additive noise model from Y to X. 

In this work we prove a slightly weaker statement. Usually the distribution n(l) (similar 
for p{k)) is determined by m-1 free parameters. As long as the sum remains smaller than 
1, there are no (equality) constraints for the values of n(0), . . . , n(fh — 2). Only n{fh — 1) 
is determined by X]/=o ln (0 = ^ e snow that in the case of an reversible additive 
noise model the number of free parameters of the marginal n{l) is heavily reduced. The 
exact number of constraints depends on the backward model, but can be bounded from 
below by 2. Furthermore the proof shows that a dependence between values of p and n is 
introduced. Both of these constraints are considered to lead to non-generic models. That 
means for any generic choice of p and n we can only have an additive noise model in one 
direction. 

Note further that (#supp X ■ #supp N) is the number of points (x, y) that have probability 
greater than 0. It must be possible to distribute these points equally to all points from 
#supp Y in order to allow a backward additive noise model. Thus we have the necessary 
condition #suppF | (#suppX • #supp N). 
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Theorem 5 Assume Y = f{X) + N, N X X with non-uniform X (m-cyclic), Y (Tri- 
cyclic) and N (fh-cyclic) and non-constant f. 

(i) If #suppY /(^supp A~ • #suppiV) then there is no additive noise model from Y to 
X. 

(ii) Assume that #suppX = m, #suppiV = rh. If there is an additive noise model from 
Y to X, at least one additional equality constraint is introduced for the choice of 
either p or n. 

3.3 Mixed Constraints 

With the results developed in the last two section we can cover even models with mixed 



constraints (for the precise conditions of "usually" see Section 3.2): 

Y = f{X) + N, N X X, X cyclic, Y, N non-cyclic 
U Y = f(X) +N, N XX, X cyclic, Y, N cyclic 

Usually there is no ANM X = g(Y) +N, N ±Y, Y cyclic, X, N cyclic 
Usually there is no ANM X = g(Y) + N, N X Y, Y non-cyclic, X, N cyclic 

And, conversely: 

Y = f(X) + N, N ±X, X non-cyclic, Y, N cyclic 
U Y = f{X) + N, N ±X, X cyclic, Y, N cyclic 

Usually there is no ANM X = g(Y) + N, N X Y, Y cyclic, X, N cyclic 
U Usually there is no ANM X = g(Y) + N, N X Y, Y cyclic, X, N non-cyclic 

4 Practical Method for Causal Inference 

Based on our theoretical findings in Section [3] we propose the following method for causal 



inference (see Hoyer et al. (2009) for the continuous case): 



1. Given: iid data of the joint distribution p( x < Y \ 

2. Regression of the model Y = f{X) + iV leads to residuals N, 
regression of the model Y = f{X) + N leads to residuals N . 

3. If N X X and N X Y infer "X is causing Y", 
if N X X and N X Y infer "Y is causing X", 

if N X ^ an d N X Y infer "/ do not know (bad model fit)" and 

if iV" X X and N JLY infer "I do not know (both directions possible)". 

(The identifiability results show that in the generic case we will not find that both N JL X 
and N X Y.) This procedure requires discrete methods for regression and independence 
testing and we now discuss our choices. 
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4.1 Regression Method 

Given a finite number of iid samples of the joint distribution p(^> y ) we denote the sample 
distribution by ~p( x > Y \ In continuous regression we usually minimize a sum consisting of a 
loss function (like an ^-error) and a regularization term that prevents us from overfitting. 

Regularization of the regression function is at least in principle not necessary in the 
discrete case. Since we may observe many different values of Y for one specific X value 
there is no risk in overfitting. This introduces further difficulties compared to continuous 
regression since in principle we now have to try all possible functions from X to Y and 
compare the corresponding values of the loss function. 

Minimizing a loss function like an i p error is not appropriate for our purpose, either: 
after regression we evaluate the proposed function by checking the independence of the 
residuals. Thus we should choose the function that makes the residuals as independent 
as possible. Therefore we consider a dependence measure (DM) between residuals and 
regressor as loss function, which we denote by DM(iV, X). 

Two problems remain: 

(1) Assume the different X values x\ < . . . < x n occur in the sample distribution i*( x > Y \ 
Then one only has to evaluate the regression function on these values. More problematic 
is the range of the function. Since we can only deal with finite numbers, we have to restrict 
the range to a finite set. No matter how large we choose this set, it is always possible 
that the resulting function class does not contain the true function. But since we used 
the freedom of choosing an additive constant to require n(0) > n{k) and n(0) > n(k) 
for all k 7^ 0, we will always find a sample (Xj, Yi) with Y = f{Xi) if the sample size is 
large enough. Thus it would be reasonable to consider all Y values that occur together 
with X = x as a potential value for f(x). But to even further reduce the impact of this 
problem we regard all values between min Y and max Y as possible values for /. And 
if there are too few samples with X = Xj and the true value f(xj) is not included in 
{min Y, min Y + 1, . . . , maxY} we may not find the true function /, but the few "wrong" 
residuals do not have an impact on the independence. In practice the following deliberation 
is much more relevant: 

(2) Even if all values of the true function / are one of the m := #{minY, minY + 
l,...,maxY} considered values, the problem of checking all possible functions is not 
tractable: If n = 20 and m = 16 there are 16 20 = 2 80 possible functions (the amount 
of particles in the universe is estimated to be « 10 79 ). We thus propose the following 
heuristic but very efficient procedure (the experimental results will show that it works 
very reliably in practice): 

Start with an initial function f° that maps every value x to the y which occurred 
(together with this x) most often under all y. Iteratively we then update each function 
value separately. Keeping all other function values f(x) with i / x fixed we choose f{x) 
to be the value that results in the "most independent" residuals. This is done for all x 
and repeated up to J times as shown in Algorithm 1. Recall that we required n(0) > n(k) 
for all k. 

In the algorithm, fx{i^y(X) means that we use the current version of but change 

the function value f{xi) to be y. If the argmax in the initialization step is not unique we 
take the largest possible y. We can even accelerate the iteration step if we do not consider 
all possible values {minY, . . . ,maxY}, but only the five that give the highest values of 
P(X = xi, Y = y) instead. 
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Algorithm 1 Discrete Regression with Dependence Minimization 



1: Input: P(X,Y) 

2: Output: / 

3: p°'(xi) : = argmax y P(X = Xi,Y = y) 

4: repeat 

5: j=j + l 

6: for i in a random ordering do 

7: f^(xi) := argmin y BM(X, Y - f^(X)) 

8: end for 

9: until residuals N := Y — f J (X) are independent of X or j = J. 



4.2 Independence Test and Dependence Measure 



Assume we are given joint iid samples (W{,Zi) of the discrete variables W and Z and 
we want to test whether W and Z are independent. In our implementation we only use 



Person's x test ( e -g- E. L. Lehmann (2005)), which is most commonly used. It computes 



the difference between observed frequencies and expected frequencies in the contingency 
table. The test statistic is known to converge towards a x 2 distribution, which is taken 
as an approximation even in the finite sample case. For very few samples, however, this 



approximation and therefore the test will usually fail. It has been suggested (e.g. RProject 
(2009)) that instead of a x 2 test, Fisher's exact test (E. L. Lehmann 2005) could be used 



if not more than 80% of the expected counts are larger than 5 ("Cochran's condition"). 

For a dependence measure we simply use the p-value of the independence test or the 
test statistic if the p-value is too small (in a computer system the p-value is sometimes 
regarded to be zero). 



5 Experiments 
5.1 Simulated Data. 

We first investigate the performance of our method on synthetic data sets. Therefore 
we simulate data from additive noise models and check whether the method is able to 
rediscover the true model. We showed in Section [3] that only very few examples allow a 
reversible ANM. 

Data sets la and lb support these theoretical results. We simulate a large amount of 
data (1000-2000 data points) from many randomly chosen models. All models that allow 
an ANM in both directions are instances of our examples from above (without exception) . 

Data sets 2a and 2b show how well our method performs for small data size and models 
that are close to non-identifiable. 

Data set 3a investigates empirically the run-time performance of our regression method 
and compares it with a brute-force search. 

Data set 3b shows that the method does not favor one direction if the supports of X 
and Y are of different size. 

5.1.1 Integer Constraints 
Data set la (identifiability). 

With equal probability we sample from a model with (1) supp X C {1, . . . , 4}, (2) supp X C 
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{1, . . . ,6}, (3) X binomial with parameters (n,p), (4) X geometric with parameter p, (5) 
X hypergeometric with parameters (M,K,N), (6) X negative binomial with parameters 
(n,p) or (7) X Poisson with parameter A. The parameters of these distributions and the 
function (with values between —7 and 7) are also randomly chosen. We then consider 1000 
different models. For each model we sample 1000 data points and apply our algorithm 
with a = 0.05. 

The results given in Table [T] show that the methods works well on almost all simulated 
data sets. The algorithm outputs "bad fit in both directions" in roughly 5% of all cases, 
which corresponds to the chosen test level. The model is non-identifiable only in very 
few cases, which are shown in Table [T] All of these cases are instances of the counter 
examples from above. This experiment further supports our proposition that the model 
is identifiable in the generic case. 



=ff samples 


correct dir. 


wrong dir. 


"both dir. poss." 


"bad fit in both dir." 


total 


89.9% 


0% 


5.3% 


4.8% 


non-overlapping noise 






3.0% 




/ constant 






2.3% 





Table 1: Data Set la. The algorithm identifies the true causal direction in almost all 
cases. All models that were classified as reversible are either instances, where the noise 
does not "overlap" (i.e. f(X) +suppiV are disjoint) or where / is constant. For the 
remaining models the algorithm mistakes the residuals as being dependent in 4.8% of the 
cases, which corresponds to the test level of a = 5%. 



Data set 2a (close to non-identifiable). 

For this data set we sample from the model Y = f{X) + N with 

n(-2) = 0.2, n(0) = 0.5, n(2) = 0.3 and /(-3) = /(l) = 1, /(-l) = /(3) = 2. 
Depending on the parameter r we sample X from 

p x (-3) = 0.1 + ~ px(-l) = 0.3 - r -, Px (l) = 0.15 - ~, Px (3) = 0.45 + ~. 

For each value of the parameter r ranging between —0.2 < r < 0.2 we use 100 different 
samples, each of which has the size 400. 

In Theorem [2] we proved that the ANM is reversible if and only if r = 0. Figure [5] shows 
that the algorithm identifies the correct direction for Again, the test level of a = 5% 

introduces indecisiveness of roughly the same size, which can be seen for |r| > 0.15. The 
number of such cases can be reduced by decreasing a (but would lead to some wrongly 
accepted backward models, too). 

Data set 3a (fast regression). 
The space of all functions from the domain of X to the domain of Y is growing very 
quickly in their sizes: If #suppX = m and ^suppY = m then the space J- := {/ : 
suppX — ► suppY} has rh m elements. If one of the variables has infinite support the 
set is even infinitely large (although this does not happen for any finite data set). It 
is clear that it is infeasible to optimize the regression criterion by trying every single 
function. As mentioned before one can argue that with high probability it is enough to 
only check the functions that correspond to an empirical mass that is greater than (again 
assuming n(0) > 0): It is likely that P(X = —2,Y = /(— 2)) > 0. We call these functions 
"empirically supported". But even this approach is often infeasible. In this experiment 
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Figure 5: Figure 1: Data set 2a. Proportion of correct and false results of the algorithm 
depending on the distribution of N. The model is not identifiable for r = 0. If r differs 
significantly from almost all decisions are correct. 
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Figure 6: Data set 3a. The size of the whole function space, the number of all functions 
with empirical support and the number of functions checked by our algorithm is shown 
for N\ (left) and iV2 (right). An extensive search would be intractable in these cases. The 
algorithm we propose is very efficient and still finds always the correct function for all 
data sets. 



we compare the number of possible functions (with values between min Y and max Y), the 
number of empirically supported functions and the number of functions the algorithm we 
proposed in Section 4.1 check in order to find the true function (which it always did). 

We simulated from the model Y = round(0.5 • X 2 ) + TV" for two different noise distri- 
butions: 

m(-2) = m(2) = 0.05, ni(-i) = m(0) = m(i) = 0.3 

and 



n 2 (-3) = n 2 (3) = 0.05, n 2 (-2) = n 2 (-l) = n 2 (0) = n 2 (l) = n 2 (2) = 0.18 

Each time we simulated a uniformly distributed X with i values between — and l - = ^- 
for ? = 3,5,7,..., 19. For each noise/regressor distribution we simulated 100 data sets. 

For JVi and i = 9, for example, there are (11 — (— 2)) 9 ~ 1.1 • 10 10 possible functions 
in total and 5 9 ~ 2.0 • 10 6 functions with positive empirical support. Our method only 
checked 104 ± 33 functions before termination. The full results are shown in Figure |6j 
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5.1.2 Cyclic Constraints 
Data set lb (identifiability). 

For the three combinations (m, rh) £ {(3, 3), (3, 5), (5, 3)} we consider 1000 different models 
each: We randomly choose a function / ^ const and distributions for p and N. For each 
model we sample 2000 data points and apply our algorithm with a = 0.05. 

The results given in Table [2] show that the method works well on almost all simulated 
data sets. The algorithm outputs "bad fit in both directions" in roughly 5% of all cases, 
which corresponds to the chosen test level. The model is non-identifiable only in very few 
cases, which are shown in Table |3j All of these cases are instances of the counter examples 
from above. This experiment further supports our theoretical result that the model is 
identifiable in the generic case. 



TO 


m 


correct dir. 


wrong dir. 


"both dir. poss." 


"bad fit in both dir." 


3 


3 


95.3% 


0% 


0.8% 


3.9% 


3 


5 


94.8% 


0% 


0.0% 


5.2% 


5 


3 


95.5% 


0% 


0.3% 


4.2% 



Table 2: Data Set lb. The algorithm identifies the true causal direction in almost all 
cases. In a proportion corresponding to the test level (~ 5%), the algorithm mistakes the 
residuals as being dependent and thus does not find the correct model. Only in very few 
cases a model can be fit in both directions, which supports the results of section 3. 



function / 


p(l),...,p(m) 


n(l), . . . , n{m) 


Ex. 


h-> 0, 1 h-> 1, 2 i->- 


0.27, 0.05, 0.69 


0.30,0.33,0.37 


l(i) 


i * 0, 1 i * 2, 2 i ► 


0.83, 0.00, 0.17 


0.15,0.26,0.58 


l(i) 


i * 0, 1 i * 0, 2 i ► 2 


0.40, 0.60, 0.00 


0.37,0.41,0.22 


l(i) 


0^2,1^0,2^2 


0.34, 0.53, 0.14 


0.33,0.34,0.33 


l(ii) 


0^0,1^0,2^2 


0.17, 0.74, 0.09 


0.32,0.33,0.35 


l(ii) 


0^1,1^0,2^1 


0.38, 0.42, 0.20 


0.33,0.34,0.33 


1(H) 


0^1,1^2,2^2 


0.02, 0.86, 0.12 


0.35,0.30,0.35 


1(H) 


0^2,1^1,2^0 


0.33, 0.33, 0.34 


0.85,0.14,0.02 


2 


0i->1,1i->0,2i->1,3i->0,4i->0 


0.20, 0.47, 0.14, 0.08, 0.12 


0.33,0.33,0.34 


1(H) 


0^1,1^0,2^1,3^1,4^1 


0.55, 0.01, 0.03, 0.26, 0.14 


0.37,0.32,0.31 


l(i) 


i-> 0, 1 i-> 1, 2 i-> 0, 3 i-> 1, 4 i-> 2 


0.03, 0.71, 0.06, 0.10, 0.32 


0.32,0.34,0.34 


1(H) 



Table 3: Data Set lb. This table shows the cases, where both directions were possible. 
Without exception they are instances of the examples given in section 3. 

Data set 2b (close to non- identifiable). 

For this data set let m = m = 4 and / = id. The distribution of p is given by: p(0) = 
0.6, p(l) = 0.1, p{2) = 0.1, p(3) = 0.2. Depending on the parameter | < r < 0.8 we sample 
the noise N from the distribution n(0) = n(l) = r/2,n(2) = n(3) = 1/2 — r/2. That 
means N is uniformly distributed if and only if r = ^ . 

Example 1 and the fact that n(0) ^ n(2) (see proof of Proposition [7| show that the 
model is not identifiable if and only if the noise distribution is uniform, i.e. if and only if 
r = I • The further r is away from \ , the more the noise differs from a uniform distribution 
and the easier it should be for our method to detect the true direction. For each value of 
the parameter r we use 100 different samples, each of which has size 200. Figure [7] shows 
the results. 
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For r = 0.58 and r = 0.68 (indicated by the arrows in Figure [7]) we further investigate 
the dependence on the data size. Clearly, r = 0.58 results in a model that is still very 
close to non-identifiability and thus we need more data to perform well (see Figure [8| . 
Note that non-identifiable models lead to very few, but not to wrong decisions. 
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Figure 7: Data set 2b. Proportion of cor- 
rect results of the algorithm depending on 
the distribution of ./V (test level 1%). The 
model is not identifiable for r = 0.5. If r 
differs significantly from 0.5 the algorithm 
makes a decisions in almost all cases. 



Figure 8: For r = 0.58 (top) and r = 
0.68 (bottom) the performance depending 
on the data size is shown. More data is 
needed if the true model is close to non- 
identifiable (top). In both cases the per- 
formance clearly increases with the sample 
size. 



Data set 3b (no direction is favored a priori). 

This experiment shows that our method does not favor one direction if the supports 
of the two random variables are very unequal in size. We choose two examples, where 
m ■= #X := #suppX = 2 and rh := #y : = #suppy = 10. Since there are 2 10 = 1024 
function from y to X, but only 10 2 = 100 functions from X to y one could expect the 
method to favor models from Y to X. We show that this is not the case. 

For m 7^ rh € {2, 10} and m ^ rh E {3,20} we randomly choose distributions for X 
and N and a function / and sampled 500 data points from this model. Table [4] shows 
that the algorithm detects the true direction in almost all cases (except if the model is 
non-identifiable) . 



m 


rh 


correct dir. 


wrong dir. 


"both dir. poss." 


"bad fit in both dir." 


2 


10 


97.4% 


0% 


2.5% 


0.1% 


10 


2 


85.2% 


0% 


14.8% 


0.0% 


3 


20 


96.8% 


0% 


1.6% 


1.6% 


20 


3 


95.5% 


0% 


4.4% 


0.1% 



Table 4: Data Set 3b. The algorithm identifies the true causal direction in almost all 
cases. There is no evidence that the algorithm always favors one direction. 



5.2 Real Data. 



Data set 4 (abalone). 

We also applied our method to the abalone data set (Nash et al. 1994 1 from the UCI 



Machine Learning Repository (Asuncion & Newman 2007 1 . We tested the sex X of the 
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abalone (male (1), female (2) or infant (0)) against length Y\, diameter Y<2 and height 
Y3, which are all measured in mm, and have 70,57 and 28 different values, respectively. 
Compared to the number of samples (up to 4177) we treat this data as being discrete. 
Because we do not have information about the underlying continuous length we have to 
assume that the data structure has not been destroyed by the user-specific discretization. 
We regard X — > Y%, X — > Y2 and X — > I3 as being the ground truth, since the sex is 
probably causing the size of the abalone, but not vice versa. 

Clearly, the Y variables do not have a cyclic structure. For the sex variable, however, 
the most natural model would be a structureless set which is contained in the cyclic 
constraints; for comparison we try both models for X. Our method with integer constraints 
is able to identify all 3 directions correctly. Since X may be cyclic we also try to fit an 
ANM from Y to X with the cyclic constraints. Again, these models are rejected (see Table 
[5] and Figure [9]). We used a = 5% and the first 1000 samples of the data set. 

As mentioned earlier it is not surprising that we would accept an ANM from X to 
Y even if we put cyclic constraints on Y (which are certainly violated for this data set). 
We would obtain the following p-values: p-v&luex-^Yi = 0.17,p-valuex^n = 0.19 and 
p-valuex^Yi = 0.05. 





p-vahiex->Y 


estimated function 


p-vahicy^x (non-cyclic) 


p-valuey„+x (cyclic) 




0.17 


1 — ► 39, 1 1 — ► 51, 2 1 — > 53 


3 • 10" i4 


3 • 10-' 2 


Y 2 


0.19 


1 — ► 30, 1 1 — ► 41, 2 1 — ► 43 


2 • 10" i4 


4 • lO" 3 


Y 3 


0.05 


1 > 10, 1 ' ^ 14, 2 1 — ^ 15 





1 • io-» 



Table 5: Data Set 4. The algorithm identifies the true causal direction in all 3 cases. Here, 
we assumed a non-cyclic structure on Y and tried both cyclic and non-cyclic for X. 
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O £Z 

1 s> 0.1 

V, z 0.05 
^° 
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X=1 



X=2 



Y=1 Y=3. 



Figure 9: Data Set 4. For Y3 regressing on X (left) and vice versa (right) the plot shows the 
conditional distribution of the fitted noise given the regressor. If the noise is independent, 
then the distribution must not depend on the regressor state. Clearly, this is only the case 
for X — > I3 (left), which corresponds to the ground truth. 



For this data set the method proposed by (Sun et al. 2006 1 returns a slightly higher 



likelihood for the true causal directions than for the false directions, but this difference is 
so small, that the algorithm does not consider it to be significant. 

The abalone data set also shows that working with p-values requires some carefulness. 
For synthetic data sets that we simulate from one fixed model the p-values do not depend 
on the data size. In real world data, however, this often is the case. If the data generating 
process does not exactly follow the model we assume, but is reasonable close to it, we get 
good fits for moderate data sizes. Only including more and more data reveals the small 
difference between process and model, which therefore leads to small p-values. Figure 10 
shows how the p-values vary if we include the first n data points of the abalone data set 
(in total: 4177). One can see that although the p-values for the correct direction decrease 
they are clearly preferable to the p-values of the wrong direction. This is a well-known 
problem in applied statistics that also has to be considered using our method. 
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number of samples number of samples number of samples 



Figure 10: Data Set 4. The plots show p-values of forward and backward direction de- 
pending on the number of samples we included (no data point means p = 0). The p- value 
in the correct direction is eventually lower than any reasonable threshold. Nevertheless 
we prefer this direction since it is decreasing much more slowly than p-backward. 



Data set 5 (temperature). 

We further applied our method to a data set consisting of 9162 daily values of temper- 
ature measured in Furtwangen (Germany) (Janzing available soon I using the variables 



temperature (T, in °C) and month (M). Clearly M inherits a cyclic structure, whereas 
T does not. Since the month indicates the position of the earth relatively to the sun, 
which is surely causing the temperature on earth, we take M — > T as the ground truth. 
Here, we aggregate states and use months instead of days. This is done in order to meet 
Cochran's condition and get reliable results from the independence test; it is not a scaling 
problem of our method (if we do not aggregate the method returns pdays^T = 0.9327 and 

PT^days = 1-0000). 

For 1000 data points both directions are rejected (p-valueM->T = 2e— 4, p-valueT->M = 
le— 13). Figure 11 shows, however, that again the p- values m->t ar e decreasing much slower 
than p-values^— >m thus using other criteria than simple p-values we still may prefer this 
direction and propose it as the true one. 




2000 3000 4000 5000 
number of samples 



Figure 11: Data Set 5. The plots show p- values of forward and backward direction depend- 
ing on the number of samples we included (no data point means p = 0). Again we prefer 
the correct direction since the p- values are decreasing much more slowly than p-backward. 



6 Conclusions and Future Work 

We proposed a method that is able to infer the cause-effect relationship between two dis- 
crete random variables. We proved that for generic choices the direction of a discrete 
additive noise model is identifiable in the population case and we developed an efficient 
algorithm that is able to infer the causal relationship between two variables for a finite 
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amount of data. Since it is known that \ 2 fails for small data sizes, changing the indepen- 
dence test for those cases may lead to an even higher performance of the algorithm. 

Our method can be generalized in two directions: (1) handling more than two vari- 
ables is straightforward from a practical point of view (although one may have to introduce 
regularization to make the regression computationally feasible) and (2) it should be inves- 
tigated how our procedure can be applied to the case, where one variable is discrete and 
the other continuous. Corresponding identifiability results remain to be shown. 

In future work additive noise models should be tested on more real world data sets 
in order to support (or disprove) additive noise models as a principle in causal inference. 
Furthermore we hope that more fundamental and general principles for identifying causal 
relationships will be developed that cover additive noise models as a special case. Nev- 
ertheless we regard our work as an important step towards understanding the difference 
between cause and effect. 

Appendix 

A Proof of Theorem [2] 
Proof. 

=>: First we assume supp Y = {yo, . . . ,y m } with yo < Vi < • • • < Vm- Define the non- 
empty sets Ci := suppX|y = yj, for i = 1, . . . , m. That means Ci, . . . , C m C supp X 
are the smallest sets satisfying P(A 6 Cj | Y = yt) = 1. For all i,j it follows that 

Ci = Cj or Ci n Cj = and / = &i = const. (3) 

This is proved by an induction argument: 

Base step: At first consider C m corresponding to the largest value y m of suppY. 
Assuming f(x\) < f(x2) for x\,X2 £ C m leads to 

Vm = f(xi) + iV max < f(x 2 ) + iVmax = 2/m 
and therefore to a contradiction. 

Induction step: Now consider Ck and assume properties ^ are satisfied for all C^ 
with k < k < m. 

xeC k nC~ k 

P(N = y k - f(x)) = P(N = y k - f(x)) > Vx £ Cy, 
^C~ k cC k 

=4> C k = C k (since supp iV must have the same size for all y) 
^ f \c k = f \c- k = const 
Furthermore 

C k n C k = Vk<l<m 
=4> f U = const 

using the same argument as for C m . 

Thus we can choose some sets Co, . . . , C\ from Co, ... , C m , where I < m, such that 
Co, . . . , C\ are disjoint, and c k := f(C k ) are pairwise different values. Wlog assume 
Co = Co- Further, even the sets 

c k + supp N := {c k + h : P(N = h) > 0} 



17 



are pairwise different: If y\ = c^ + h± = c\ + hi then C& C Q and Q C Cj, which 
implies k = I. 

Now we consider the other case, namely that X has finite support. Then we define 
Co, . . . , C\ to be disjoint sets, such that / is constant on each of them: a := 
This time, it does not matter which of these sets is called Co- Since 

Cfc + supp iV = supp Y\X G Cfc 

we can use the same argumentation we used for Cj (exchange roles of X and V) and 
deduce again that the sets c/% + supp N are disjoint. 

The rest of the proof is valid for both cases (either X or Y has finite support): 
Consider C% for any i. According to the assumption that an additive noise model 
Y — > X holds we have 

n\y = c = JV|y = a 

& X - g(co)\Y = c = X - g(ci)\Y = a 

=¥■ X + di\Y = c = X\Y = a 
with di = g(ci) — g(co)- Thus Cj = Co + (including do = 0). 
For x £ Ci (which implies /(x) = Cj) we have 

P(X = x) P(X = x)P(7V = ci - /(x)) 

P(X G C) " £ ieCl P(X = x)P(iV = ci - f(x)) 

= P(x= ^:^- /(g)) =p(x = a |y = c) 

_ P(A = x - d t )P(N = cq - fjx - dj)) = P(X = x- dj) 
Ezec P(* = *)P(W = - /(x)) P(X G Co) 

In order to show that we have an reversible ANM we define the function g as follows 

g(y) = Vy G c + supp N 

g(y) = di Vy G Cj + supp N, i > 

The noise N is determined by the joint distribution p( x ' Y \ of course. It remains to 
check, whether the distribution of N\Y = y is independent of y. Consider a fixed y 
and choose i such that y G Cj+supp iV. Since Cj = Co+dj the condition g(y) + h G Cj 
is satisfied for all h G Co and therefore independently of y and Cj. If y(y) + h ^ Cj 
then P(iV = /i I y = y) = 0. And if g(y) + h £ d we have 

P(A = y(y)+/i,y = y) 



P(N = h\Y = y) = 



p(y = y) 

P(X = g(y)+h,N = y-f(g(y) + h)) 
P(Y = y) 
P(X = g(y) + h)P(N = y-c t ) 
Z ie c^(X = x)P(N = y-f(x)) 
PjX = g{y) + h) = PjX = g{y) + h - dj) 

P(X G Ci) ' ~ P(X G C ) 

P(X = h) 
PjX G C ) 
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which does not depend on y. 



□ 



B Proof of Theorem |3] 
Proof. 

1. P(N = k)> OVm < k < I and P(N = k) = 0, else. 



Y 



f(x 2 ) + N n 
f(xi)+N n 
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Assume that there is an ANM in both directions X — » V and Y — > X. As 
mentioned above we have a freedom of choosing an additive constant for the 
regression function. In the remainder of this proof we require P{N = k) = 
P(JV = k) = V7c < and P(N = 0),P(N = 0) > 0. The largest k, such that 
P(iV = A;) > will be called -/V max . In analogy to the proof above we define 
C y := suppX|Y" = y for all y G suppF. 

At first we note that all C y are shifted versions of each other (since there is a 
backward ANM) and additionally, they are finite sets. Otherwise this would 
contradict the assumptions because of the compact support of N. 
Start with any arbitrary x\ = min{/ _1 (/(xi))} and define 

xi := min {x G C f{xi)+Nn ^ \ / -1 (/(a;i))} 

This implies f(x±) > f(x±) and x\ G Cf(% x y 

If such a xi does not exist because the set on the right hand side 
is empty, then it cannot exist for any choice of x\\ It is clear that 



C 



f(xi)+N n 



f 1 {fi x i)) an d then we consider the first C 



f{xi)+N m ^+i 



that is not empty. Then this set must be f~ 1 (f(xi)) for some x\. This 
leads to an iterative procedure and to the required decomposition of 
suppX. 

We have that either max{/ _1 (/(xi))} > max{/ _1 (/(xi))} or min{/~ 1 (/(xi))} < 
min{/ _1 (/(xi))}: Otherwise C/( £l ) and C^^uj satisfy 

maxC /(£l )_ 1 > maxC /(£l ) 
minC /(£l) _ 1 < minC /(£l) 

Because of x\ G Cf^ x yx\ ^ Cf^^_i this contradicts the existence of an back- 
ward additive noise model. Wlog we therefore assume max{/ _1 (/(xi))} > 
max{/ _1 (/(a;i))}. Then we even have x\ > xi, 



xi 



mm 



i C f(xi 



)+N n 
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and 

xi = min{C /(:El)+Wmax+1 } 

(Otherwise we can use the same argument as above with Cf( X i)+N max an< ^ 
C /(*i)+JW+lO Define further 

x 2 := minrH/Ori) + ^max + 1) 

Since .TH/Oi)) C C f ^ +Naaxi but / -1 (/(a:i)) n C /(xi)+A r max+1 = 0, such a 
value must exist. Again, we can define x 2 in the same way as above. 
Set yi := f(x\) + -/V max and yi := f(x\) + 2 • iV max and consider the finite 
box from (min C yi , yi ) to (max C V2 , 2/2) • This box contains all the support from 
X I Y = f(xi) + N max +i, where i = 1, . . . , iV miH . Assume we know the positions 
in this box, where p( x > Y ^ is greater than zero. Then this box determines the 
support of X I Y = f(x±) + 2 • iV max + 1 (the line above the box) just using the 
support of iV and N. Iterating gives us the whole support of p(^' y ) in the box 
above (from y = f(x 2 ) + -/V" max to y = f(x 2 ) + 2 • iV max ). Since the width of the 
boxes are bounded by 3 • m&xCf( Xl ) — minCj( Xl ), for example, at some point 
the box of x n must have the same support as the one of x\. Figure [T] shows an 
example, in which n = 2. Using the whole distributions of N and iV we can 
now determine a factor a with 

P(X = Xl ,Y = /(xi) + AW) = a • P(X = x n ,Y = f(x n ) + AW) 

over the sequence (x±,xi,x 2 , x 2 , . . . , x n ). But since we computed the boxes in 
a deterministic way, the same a satisfies 

P{X = x n ,Y = f{x n ) + AW) = a ■ P{X = x 2n -i,Y = f(x 2n -i) + AW) 

and therefore 

P{X = xi,Y = /(xi)+AW) = a k -P{X = x (k+1)n _ k ,Y = f{x (k+1)r ^ k )+N max ) 

Note that a corresponding equation with the same constant a holds for the 
opposite direction. This leads to a contradiction, since there is no probability 
distribution for X with infinite support that can fulfill this condition (no matter 
if a is greater, equal or smaller than 1). 

-4=: This direction is proved in exactly the same way as in Theorem [2] 

2. P(N = k)> OVA; € Z. 

Since X and Y are dependent there are y± and y 2 , such that (7(2/1) / 9(2/2)- Compar- 
ing P(X = k,Y = yi) and P(X = k,Y = 2/2) for k > m, we can identify the differ- 
ence d := 0(2/2) -g(yi)- If d < Owe can use ^x^m-^Y-yi^ and = m,Y = y 2 ) 
in order to determine P(X = m — 1, Y = y 2 ). If d > we use p ^f~ m+rf ~ l 'Y ~ v ? ' and 

v ' P(X=m+d,Y=y2) 

P(X = m,Y = y\) in order to determine P(X = m — 1, Y = yi). In both cases this 
yields P(X = m — 1) and with an iterative procedure all P(X = x). 

□ 
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C Proof of Theorem [5] 

regarding (i): Each distribution Y\X = Xj has to have the same support (up to an 
additive shift) and thus the same number of elements with probability greater than 0: 

#supp X ■ #supp N = k ■ #supp Y 

For (ii) we now consider 3 different cases and show necessary conditions for reversibility 
each. 

Case 1: / and g are bijective. 

Proposition 6 Assume Y = f(X) + N, N X X for bijective f and n(l) ^ 0,p(A;) ^ 
OVfc, I. If the model is reversible with a bijective g, then X and Y are uniformly 
distributed. 

Proof. Since g is bijective we have that Vy3t y : g{t y ) = g{y) — 1. It follows from 

n(y - f(x + l))p(x + 1) _ n(x + 1 - g(y))g(y) 
n(t y - f(x))p(x) n(x + 1 - g(y))q(t y ) 

This implies 

P(x + 1) = n{t y -f(x))q(y) ^ 1 = p(x + m) = Uk=o Mfr - f(x + k))q(y) m 
P{x) ~ n(y-f(x + l))q(t y ) " p(x) " n (y - f(x + k + l))q(t y ) 

Since / is bijective it follows that q{y) = q{t y ). This holds for all y and thus Y and 
X are uniformly distributed. □ 



Case 2: g is not injective. 

Assume g(yo) = g(yi). From ^ it follows that 

ni : y0 ~ KX) } = 44 V* and "fo-fl*)) = n (V0-fW) y x ^ (4) 
n{yi - f{x)) q{yi) n(y 1 - f\x)) n(y 1 - f(x)) 

which imply equality constraints on n. To determine the number of constraints 
we define a function that maps the arguments of the numerator to those of the 
denominator 

Im(y - /) -> Z/fhZ 
">yom,f- yo -f( x ) i ^ yi -/(x) • 

We say /i has a cycle if there is a z £ N, s.t. h k {a) = (ho. . .oh) (a) G Im(yo — /) Vfc < z 
and fo z (a) = a. For example: 2i-^>4i-^6i-^>0^2. 

Proposition 7 Assume Y = f(X) + N, N ± X and n(l) / 0,p(/s) ^ OV/c,/. 
Assume further that the model is reversible with a non-injective g. 

• If h has only cycles, Im/ — ^cycles + 1 parameters of n are fixed. 

• Otherwise Im/ — ^cycles parameters of n are fixed. 
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Proof. Assume h has a cycle, then = 1 and n(yo — fi x )) = ~~ fi x )) ^ x - 
For any non-cyclic structure of length r (e.g. 3 i— » 5 i—* 7 and 7 ^ Im(yo — /))> 
r — 1 values of n are determined, but for cycles of length r (e.g. \—* 2 i— > 4 i— > 
6 i— ► 0) we get only r — 2 independent equations. Together with the normalization 
these are Imf — ^cycles (equality) constraints. If h has no cycle, we have Imf — 1 
independent equations plus the sum constraint. E.g.: = = ^© implies 

"(4) = n(6)^g and „(2) = 5$ . 
Further, 

wQ/o - f(x)) = q(y ) = ^p(x)n(y - f{x) 
n(yi - f(x)) q(yi) - /(»)) 

introduces a functional relationship between p and n. □ 

Note that if fh does not have any divisors, there are no cycles and thus #Im/ 
parameters of n are determined. 

Corollary 8 In cases i/ie number of fixed parameters is lower bounded by max( [1 /2- 
Im/],2)>2. 

Case 3: / is not injective. 

Assume /(xq) = f(xi). In a slight abuse of notation we write 

Z/mZ x Z/mZ — Z/mZ 

5-5 ' g(y)-g(y) ' 

Similar as above, we define 

Im(ac - (g ~ 9)) Z/mZ 
xo^Lff ■ zo - 5(2/) + s(&0 si - #(y) + s(z7) " 

We say that /i has a cycle if there is a 2 G N, s.t. h k (a) = (h o . . . o /i)(a) G 
Im(xo — {g — ff))Vfc < z and fo 2 (a) = a. 

Proposition 9 Assume Y = f{X) + N, N JL X , f is not injective and n(l) 7^ 
0,p(fc) 7^ OVA;, I. Assume further that the model is reversible for a function g. 

• If h has only cycles, lm(g — g) — ^cycles + 1 parameters of p are fixed. 

• Otherwise Im(g — g) — ^cycles parameters of p are fixed. 

Proof. From Q it follows that 

p(x ) _ h{x Q - g{y)) _ p( x o ~ 9(v) + 9®)) -n(y-f(x - g{y) + g(y))J 
p( Xl ) n{ Xl - g(y)) p fa _ g{y) + g ^ . n ^ _ _ g{y) + g ^ 

The rest follows analogously to the proof of Proposition [7j □ 

If (x\ — xq) does not divide m, there are no cycles and thus Im(<7 — g) parameters of 
p are determined. 

Corollary 10 In all cases the number of fixed parameters is lower bounded by 
max(ri/2-Im(s-s)l,2) > 2. 

Note that these three cases are sufficient since / and g injective implies n = m and / and 
g bijective. 



Vy,y 
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